{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "##### \n",
    "\n",
    "```\n",
    "make venv \n",
    "source venv/bin/activate \n",
    "pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install data-prep-toolkit\n",
    "!pip install data-prep-toolkit-transforms\n",
    "!pip install data-prep-connector"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "614f0633-ad65-4994-9d61-0c21986ca3eb",
   "metadata": {},
   "source": [
    "##### **** Note: must enable nested asynchronous io in a notebook as the crawler uses coroutine to speed up acquisition and downloads\n",
    "#####\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b6c89ac7-6824-4d99-8120-7d5b150bd683",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
   "metadata": {},
   "source": [
    "##### **** Configure the crawler parameter and invoke the transform function\n",
    "##### \n",
    "| parameter:type | Description |\n",
    "| --- | --- |\n",
    "| urls: list | list of seeds URL (i.e. ['https://thealliance.ai'] or ['www.ibm.com/docs','www.ibm.com/help']. The list can include any number of valid urls |\n",
    "|depth: int | control crawling depth |\n",
    "| downloads: int | number of downloads that are stored to the download folder |\n",
    "| folder: str | folder where downloaded files are stored |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "from dpk_web2parquet.transform import Web2Parquet\n",
    "Web2Parquet(urls= ['https://thealliance.ai/'],\n",
    "                    depth=2, \n",
    "                    downloads=10,\n",
    "                    folder='downloads').transform()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "##### **** The specified folder will include the downloaded files. The file name is the full URL where the / is replaced with an _ and the file extension is based on returned content-type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['downloads/thealliance_ai_core-projects-ntia_request_text.html',\n",
       " 'downloads/thealliance_ai_focus-areas-advocacy_text.html',\n",
       " 'downloads/thealliance_ai_blog-open-source-ai-demo-night-sf-2024_text.html',\n",
       " 'downloads/thealliance_ai_contact_text.html',\n",
       " 'downloads/thealliance_ai_core-projects-sb1047_text.html',\n",
       " 'downloads/thealliance_ai_focus-areas-foundation-models-datasets_text.html',\n",
       " 'downloads/thealliance_ai_focus-areas-hardware-enablement_text.html',\n",
       " 'downloads/thealliance_ai_core-projects-trusted-evals_text.html',\n",
       " 'downloads/thealliance_ai__text.html',\n",
       " 'downloads/thealliance_ai_contribute_text.html',\n",
       " 'downloads/thealliance_ai_community_text.html',\n",
       " 'downloads/thealliance_ai_become-a-collaborator_text.html']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "glob.glob(\"downloads/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fef6667e-71ed-4054-9382-55c6bb3fda70",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
